Singular value decomposition (SVD) is one of the most popular compression methods that approximate a target matrix with smaller matrices. However, standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption. The parameters of a trained neural network model may affect task performance unevenly, which suggests non-equal importance among the parameters. Compared to SVD, the decomposition method aware of parameter importance is the more practical choice in real cases. Unlike standard SVD, weighted value decomposition is a non-convex optimization problem that lacks a closed-form solution. We systematically investigated multiple optimization strategies to tackle the problem and examined our method by compressing Transformer-based language models. Further, we designed a metric to predict when the SVD may introduce a significant performance drop, for which our method can be a rescue strategy. The extensive evaluations demonstrate that our method can perform better than current SOTA methods in compressing Transformer-based language models.
translated by 谷歌翻译
将大型矩阵分配到小矩阵中是模型压缩的流行策略。奇异值分解(SVD)在这种压缩策略中起着至关重要的作用,近似具有较少参数的学习矩阵。但是,SVD最大程度地减少了平方误差以重建原始矩阵而不衡量参数的重要性,从而为那些影响任务准确性的人提供了更大的重建误差。换句话说,SVD的优化目标与受过训练的模型的任务准确性不符。我们通过引入Fisher信息来权衡影响模型预测的参数的重要性来分析此先前未开发的问题,进行观察并解决该问题。这个想法导致了我们的方法:Fisher加权SVD(FWSVD)。尽管我们方法的分解矩阵并没有导致较小的重建错误,但我们发现我们所得的任务准确性更接近原始模型的性能。我们使用基于变压器的语言模型进行分析,显示我们的加权SVD很大程度上减轻了不匹配的优化目标,并可以以更高的压缩率维持模型性能。我们的方法可以直接压缩特定于任务的模型,同时比需要昂贵的模型预训练的其他紧凑型模型策略更好。此外,对压缩模型的评估表明,我们的方法可以进一步降低9%至30%的参数,对任务准确性产生不大的影响。
translated by 谷歌翻译
域分类是自然语言理解(NLU)中的基本任务,通常需要快速住宿到新的新兴域。即使新模型可访问,此约束使其无法培育所有先前的域。大多数现有的持续学习方法患有低精度和性能波动,特别是当旧数据和新数据的分布显着不同时。事实上,关键的真实问题不是没有旧数据的,而是效率效率恢复模型与整个旧数据集。是否有可能利用一些旧数据来产生高精度并保持稳定的性能,同时在不引入额外的普通公共表?在本文中,我们提出了一个可在各种环境下稳定地产生高性能的文本数据的一个封路数据不断学习模型。具体地,我们利用Fisher信息选择可以“记录”原始模型的关键信息的示例。此外,提出了一种称为动态重量整合的新颖方案,以在恢复过程中启用自由的自由学习。广泛的实验表明基线患有波动的性能,因此在实践中无用。相反,我们建议的CCFI显着且始终如一地优于平均精度高达20%的最佳最新方法,CCFI的每个组件有效地贡献了整体性能。
translated by 谷歌翻译
众所周知,视觉分类模型在数据分布班面上遭受较差的校准。在本文中,我们对此问题采取了几何方法。我们提出几何灵敏度分解(GSD)将样本特征嵌入的标准分解为目标分类器的示例特征嵌入和角度相似度分解为依赖于实例和实例 - 独立的组件。实例相关组件捕获关于输入中的更改的敏感信息,而实例无关的组件仅表示仅用于最小化训练数据集的丢失的不敏感信息。灵感来自分解,我们分析了一个简单的扩展到当前的SoftMax-Linear模型,这在训练期间学会解开两个组件。在几种常见视觉模型上,脱谕式模型在面对配送(OOD)数据和腐败方面的标准校准度量上的其他校准方法表现出明显不那么复杂。具体而言,我们将当前技术超越30.8%的相对改善对预期校准误差的损坏的CIFAR100。代码在https://github.com/gt-ripl/geometric -sentivity-decomposition.git。
translated by 谷歌翻译
Deep neural networks have attained remarkable performance when applied to data that comes from the same distribution as that of the training set, but can significantly degrade otherwise. Therefore, detecting whether an example is out-of-distribution (OoD) is crucial to enable a system that can reject such samples or alert users. Recent works have made significant progress on OoD benchmarks consisting of small image datasets. However, many recent methods based on neural networks rely on training or tuning with both in-distribution and out-of-distribution data. The latter is generally hard to define a-priori, and its selection can easily bias the learning. We base our work on a popular method ODIN 1 [21], proposing two strategies for freeing it from the needs of tuning with OoD data, while improving its OoD detection performance. We specifically propose to decompose confidence scoring as well as a modified input pre-processing method. We show that both of these significantly help in detection performance. Our further analysis on a larger scale image dataset shows that the two types of distribution shifts, specifically semantic shift and non-semantic shift, present a significant difference in the difficulty of the problem, providing an analysis of when ODIN-like strategies do or do not work.
translated by 谷歌翻译
Accurate and smooth global navigation satellite system (GNSS) positioning for pedestrians in urban canyons is still a challenge due to the multipath effects and the non-light-of-sight (NLOS) receptions caused by the reflections from surrounding buildings. The recently developed factor graph optimization (FGO) based GNSS positioning method opened a new window for improving urban GNSS positioning by effectively exploiting the measurement redundancy from the historical information to resist the outlier measurements. Unfortunately, the FGO-based GNSS standalone positioning is still challenged in highly urbanized areas. As an extension of the previous FGO-based GNSS positioning method, this paper exploits the potential of the pedestrian dead reckoning (PDR) model in FGO to improve the GNSS standalone positioning performance in urban canyons. Specifically, the relative motion of the pedestrian is estimated based on the raw acceleration measurements from the onboard smartphone inertial measurement unit (IMU) via the PDR algorithm. Then the raw GNSS pseudorange, Doppler measurements, and relative motion from PDR are integrated using the FGO. Given the context of pedestrian navigation with a small acceleration most of the time, a novel soft motion model is proposed to smooth the states involved in the factor graph model. The effectiveness of the proposed method is verified step-by-step through two datasets collected in dense urban canyons of Hong Kong using smartphone-level GNSS receivers. The comparison between the conventional extended Kalman filter, several existing methods, and FGO-based integration is presented. The results reveal that the existing FGO-based GNSS standalone positioning is highly complementary to the PDR's relative motion estimation. Both improved positioning accuracy and trajectory smoothness are obtained with the help of the proposed method.
translated by 谷歌翻译
The demand of high-resolution video contents has grown over the years. However, the delivery of high-resolution video is constrained by either computational resources required for rendering or network bandwidth for remote transmission. To remedy this limitation, we leverage the eye trackers found alongside existing augmented and virtual reality headsets. We propose the application of video super-resolution (VSR) technique to fuse low-resolution context with regional high-resolution context for resource-constrained consumption of high-resolution content without perceivable drop in quality. Eye trackers provide us the gaze direction of a user, aiding us in the extraction of the regional high-resolution context. As only pixels that falls within the gaze region can be resolved by the human eye, a large amount of the delivered content is redundant as we can't perceive the difference in quality of the region beyond the observed region. To generate a visually pleasing frame from the fusion of high-resolution region and low-resolution region, we study the capability of a deep neural network of transferring the context of the observed region to other regions (low-resolution) of the current and future frames. We label this task a Foveated Video Super-Resolution (FVSR), as we need to super-resolve the low-resolution regions of current and future frames through the fusion of pixels from the gaze region. We propose Cross-Resolution Flow Propagation (CRFP) for FVSR. We train and evaluate CRFP on REDS dataset on the task of 8x FVSR, i.e. a combination of 8x VSR and the fusion of foveated region. Departing from the conventional evaluation of per frame quality using SSIM or PSNR, we propose the evaluation of past foveated region, measuring the capability of a model to leverage the noise present in eye trackers during FVSR. Code is made available at https://github.com/eugenelet/CRFP.
translated by 谷歌翻译
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.
translated by 谷歌翻译
Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.
translated by 谷歌翻译
Relation Extraction (RE) has been extended to cross-document scenarios because many relations are not simply described in a single document. This inevitably brings the challenge of efficient open-space evidence retrieval to support the inference of cross-document relations, along with the challenge of multi-hop reasoning on top of entities and evidence scattered in an open set of documents. To combat these challenges, we propose Mr.CoD, a multi-hop evidence retrieval method based on evidence path mining and ranking with adapted dense retrievers. We explore multiple variants of retrievers to show evidence retrieval is an essential part in cross-document RE. Experiments on CodRED show that evidence retrieval with Mr.Cod effectively acquires cross-document evidence that essentially supports open-setting cross-document RE. Additionally, we show that Mr.CoD facilitates evidence retrieval and boosts end-to-end RE performance with effective multi-hop reasoning in both closed and open settings of RE.
translated by 谷歌翻译